Optimal Stem Identification in Presence of Suffix List

نویسندگان

  • N. Vasudevan
  • Pushpak Bhattacharyya
چکیده

Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a process of obtaining minimum number of lexicon from an unannotated corpus by using a suffix set. We proved that the exact lexicon reduction problem is NP-hard and came up with a polynomial time approximation. One probabilistic model that minimizes the stem distributional entropy is also proposed for stemming. Performances of these models are analyzed using an unannotated corpus and a suffix set of Malayalam, a morphologically rich language of India belonging to the Dravidian family.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Framework for Learning Morphology using Suffix Association Matrix

Unsupervised learning of morphology is used for automatic affix identification, morphological segmentation of words and generating paradigms which give a list of all affixes that can be combined with a list of stems. Various unsupervised approaches are used to segment words into stem and suffix. Most unsupervised methods used to learn morphology assume that suffixes occur frequently in a corpus...

متن کامل

Little by Little: Semi Supervised Stemming through Stem Set Minimization

In this paper we take an important step towards completely unsupervised stemming by giving a scheme for semi supervised stemming. The input to the system is a list of word forms and suffixes. The motivation of the work comes from the need to create a root or stem identifier for a language that has electronic corpora and some elementary linguistic work in the form of, say, suffix list. The scope...

متن کامل

Phonetic Reflexes of Morphological Boundaries at a Normal Speech Rate

Our production experiment in Scottish English revealed that the duration of a rhyme immediately followed by a Level II suffix such as –s (the 1 person singular/plural/possessive suffix) and –t (the past tense suffix) was significantly longer than that of a monomorphemic counterpart. Such a durational difference between suffixed forms and monomorphemic forms was absent when the Level II suffix w...

متن کامل

Learning Word Segmentation Rules for Tag Prediction

In our previous work we introduced a hybrid, GA&ILP-based approach for learning of stem-suffix segmentation rules from an unmarked list of words. Evaluation of the method was made difficult by the lack of word corpora annotated with their morphological segmentation. Here the hybrid approach is evaluated indirectly, on the task of tag prediction. A pair of stem-tag and suffix-tag lexicons is obt...

متن کامل

Pharmacognostic study of Argyreia pilosa stem

Background and objectives: Argyreia pilosa (Convolvulaceae) has been utilized for many aliments in the conventional system ethnomedicinally; most significantly against sexually transmitted diseases, skin troubles, diabetes, rheumatism, cough, and quinsy. The key challenge experienced in the standardization of herbal drugs is the correct identification of the plant sour...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012